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Introduction 

All systems of classroom observation share one common element: 
their dependence upon the observers who use them. Too often, little 
attention has been given to procedures used in training observers 
and the methods by which data is computed and reported in regard to 
observer reliability and validity. Yet without an estimation of the 
accuracy and relevancy of observation data collected, little confidence 
can be placed in the findings they produce. 



Problem 

The purposes of the study were (1) to compare two types of relia 
bility in the observation of teachers' behavior, (2) to explore the 
relationship between observer reliability and the validity of their 
systematic classroom observations and (3) to investigate the effects 
of training, measured observer beliefs, and the passage of time on 
reliability and validity estimates. 
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Procedures 

Instrumentation . Scores obtained by the employment of an 
observation system, the Teacher Practices Observation Record (TPOR) 
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were used to establish reliability and validity estimates. The TPOR 
is a 62-item sign system which measures the instructional practices 
of a teacher in terms of agreement-disagreement with John Dewey's 
experimentalism. The observation is recorded during a 30-minute 
session which is divided into three ten-minute observation and marking 
periods; each of the sixty-two items is to be considered and then 
checked if the described behavior occurs during the period. Thus the 
observer is required to make 186 discriminations as to the presence 
or absence of the described practices during the total observation 
period. From the observation a descriptive record of teaching behavior 
can be reported in the form of a numerical score ranging from 0 to 186. 
A TPOR score of 93 or above indicates teacher behavior in greater 
agreement than disagreement with experimentalism, below that to be in 
greater disagreement than agreement. 

Through recognition that observer biases and subjectivity will 
color records of classroom behavior, two instruments were used to 
measure the beliefs of subjects used as observers. The Personal 
Beliefs Inventory (PBI) and Teacher Practices Inventory (TPI)l were 
developed to be used in conjunction with the Teacher Practices 
Observation Record and measure fundamental philosophic and educational 
beliefs. High scores indicate agreement with experimentalism; low 
scores indicate rejection of Dewey's philosophy. 

Ipor an account of the development of these instruments and the 
Teacher Practices Observation Record, see Bob Burton Brown. The 
Experimental Mind- in Education . New York: Harper and Row, 1968. 
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Subjects , The subjects of the study «c; thirty-two experienced 
female elementary teachers selected from a -■ qi a i F *.,_>£ xda ^u^rity. 
Sixteen of the subjects were trained in the use ot the observatxon 
system; chey comprised the trained group The sixteen remn .n„ny 
subjects received no training. 

Training Procedures. The training wi observers conSxSted or rive 
two-hour training sessions held over a four -week period; ixims or 
teachers in unrehearsed classroom situations were used ior training 
purposes. Provision was made to give observers immediate IccdtaCK on 
agreement. Efforts were made by the trainer to enccui age the observer 
subjects (1) to achieve agreement in thexi responses to tne observa- 
tion instrument and (2) to record behavior in terms or the theoretical 
basis of the instrument. 

Data Collection, Two trims ot classroom teacher behavior <A and 

i — i . i , i l r — 

B) were used for data collection purposes. Each group of subjects* 
viewed the two films and recorded the observed behavior twice, once 
approximately ten weeks after training nad been completed and then 
again ten weeks after the first viewing session. The TPOR scores 
obtained in the two viewing sessions were used m the analysis of the 

data , 

Data Analy s i s The data were first used to compute twu types 
of reliability coefficients: (I) Be tween-observer , the agreement 

between observers of the same teacher behavior and computed *s a 
percent of agreement. This coefficient is a ratio of the number ut 
responses to which observers agree to the total number oi responses 
possible, and ^2} Within-observer , the stability ot an mdi vidua x 
observer's responses to the same behavior over a pexiod of txine. 
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These coefficients were computed by techniques developed by Bx own, 

2 

Mendenhall, and Beaver, In addition, oxiteixon vd.ldxty jei 1 xoients 
were developed by comparing observers 1 scores with cr i ter x on ©coxes. 
Criterion scores were compos tie scores g„ ven the firms by the trainer 
and the author of the Teacher Pi ac trees Observat i on Record The 



validity coefficient was computed using the same pro*, eduxe© da the 
within-observer reliability coefficient 

The coefficients established by these procedures were used as 
responses in linear multiple regression analysis to investigate the 
effects of training, the effects of measured beliefs and the effects 
of time on the reliability and validity of. observers* observation 
scores. Lastly, the relationships between the validity coefficients 
and reliability of observations were examined, 



Findings 

Between-Observer Reliability Coefficients -, Be tween- obse: ver 
reliability coefficients were computed tor each film for each 
viewing and for the variables under Investigation, the effects of 
training and measured beliefs, and are reported in Table 1, These 
coefficients ranging from ,77 to ,86 axe comparable with those 
reported for other observation instruments and axe remax'kabiV 
uniform. The single identifiable general trend was that untrained 
observers achieved slightly higher coef ircients than trained observers. 



Bob Burton Brown, William Mendenhall, and Robert Be^cei , 
"The Reliability of Observations of Teachers* Classroom Behavior , 
Journal of Experimental Education , 16:1-10, Spring, 1968. 
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jects belief scores fell within a fairly narrow range 



Also the trained observers' agreement tended to decrease snghtiy 
over time. The variable of beliefs seemed to have no effect, and 
training had precious little effect, on agreement between observers. 

Clearly, for the Teac her Practices Ob s ervation R e cot d , neither 
training nor beliefs have a great effect on be tween-observer 
agreement. It could be chat training was ineffectual in contributing 
to be tween-observer agreement. Efforts weis made in itsimri-g sessions 
to encourage observers to record behavior m terms 'he theoretical 
basis of the instrument, even at the expense of increasing agreement. 
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designed to be used by untrained observers; in its development, only 
items which described behavior in clear and concise terms were 
included in the final form. Thus, the composition of' the instrument 
itself leads to agreement of responses of observers who share similar 



perceptual frameworks. 

Within-Observer Reliability Coefficients , Through the comparison 
of responses of each observer to each film for the first and second 
viewings, individual within-observer coefficients, the stability of 
an observers scores, were developed and can be found in Tab*e 2, The 
coefficients range from ,34 to .77 for the trained group and from 
,41 to ,90 for the untrained group. Mean within-observer coefficients 
for trained and untrained groups were very uniform and are shown in 
Table 3. No variables could be identified which would even partially 
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WITH IN- OBSERVER RELIABILITY COEFI1C1ENTS 
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account for the wide variance between individual coefficients- There 
is no question that observers do vary greatly in the stability to 
which they respond to teaching behavior over time; however the vari- 
ables of training and beliefs did not seem to influence this stability 
for the subjects under investigation, The only factor which cud 
seem to affect the within-observer reliability coefficient was tne 
film itself. Observers responded in a more stable manner to r * lm B 
than to Film A, 

Cri terion Validity Coefficients , Individual validity cos ttii rents 
for each film for each session were computed and are reported - o 
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TABLE 3 



MEAN WITH IN -OBSERVER RELIABILITY COEFFICIENTS 




Table 4. Mean coefficients appeal in Table 5- The validity 

coefficients for these subjects were low with wide vauabi ii t.y< With- 
in these general limitations variables were identified write n would 
account for a statistically significant amount or the vai lance between 
coefficients. The multiple regression analysis indicated tna r the 
interaction of training and belief variables artected tne validity of 
subjects observations. The effects of training on observers mi: a m 
agreement with experimentalism had a tendency to produce higher valid- 
ity coefficients. This effect decreased over time The tia*rixng of 
observers less in agreement with experimentalism .had a slightly 
negative effect on validity. This effect xncreased over time 

Comparison of Coefficients > A comparison was made or tne rela- 
tionship between the validity coefficients and the wxtn ui-observet 
reliability coefficients. A significant relationship was identified; 
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CRITERION-OBSERVER VALIDITY COEFFICIENTS 
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observers who are more consistent in their 
of the same behavior over a period of time 
valid observations. This relationship is 
the observer has been trained. 



recording ot observations 
also tend to make more 
slightly accentuated it 



Conclusions 

For those who gather classroom behavioral data, through systematic 
observation, between-observer reliability needs close examination- 
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MEAN CRITERION - OBSERVER VALIDITY COEFFICIENTS 




The classic method of obtaining dependable evidence or what Happens 
in a classroom has been to tram observers to use some type at 
observation instrument, rating scale or check list. The prime pur- 
pose of training has been to achieve agreement between inserter 3 =** 
to the behaviors they are recording- Thus when the time arrives 
that observers can agree on what to labei the behavior under question, 
they are considered trained and dependable to gather accurate and 
relevant information. The data suggest that this is far from r hv case 
If one can find observers who share common perceptual t ramewoiKs , 



agreement can be achieved easily, with iitue or no training. 
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possible by selecting a fairly uniform samp!*- to Tind e ■■ = <*hu 

can easily agree, but to get them to observe behavior a 

particular theoretical framework is a tar more dirt *::u ; t ta&K. 

Wi thin-observer reliability would seem a tar more 
concept for both practical and theoretical reasons . Observe •. ions or 
classroom behavior are expensive and time consuming me f . nods ni procur- 
ing data. It is dif ficult and prohibitive in cost to oc-ad mo* e than 
one observer into a classroom to collect data. Thereto' e, dial an 
observer remain consistent in recording behavior as he moves tram 
classroom to classroom is of more importance than that tic reaches 
agreement with other observers at some point m tome, providing he 
records data in a manner relevant to the instrument be is using. 

This leads to the problem of the validity of the observer ions 
he makes. It would seem from the literature devoted to systematic, 
classroom observation that validity has been assumed to be achieved 
automatically with between-observer agreement. If problems oi 
observer validity have been entertained, they have not been reported. 
No one seems to have squarely faced the factors involved in tne valid- 
ity of classroom observations--do they really measure what they 
propose to measure? 

The study has made an approach to answering the question by 
attempting to establish criterion validity of observations or teach- 
ing behavior. The training procedures used were regrettably less 
then effective, yet it seems a step in the tight direction The 
primary purpose of training should be to establish the validity or 
the observer; the data indicates that, with validity, wi thin-observer 
reliability will follow. 
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